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Abstract 

Classifier combination methods need to make best use of the outputs of multiple, imperfect classifiers to enable 
higher accuracy classifications. In many situations, such as when human decisions need to be combined, the 
base decisions can vary enormously in reliability. A Bayesian approach to such uncertain combination allows 
us to infer the differences in performance between individuals and to incorporate any available prior knowledge 
about their abilities when training data is sparse. In this paper we explore Bayesian classifier combination, using 
the computationally efficient framework of variational Bayesian inference. We apply the approach to real data 
from a large citizen science project. Galaxy Zoo Supernovae, and show that our method far outperforms other 
established approaches to imperfect decision combination. We go on to analyse the putative community structure 
of the decision makers, based on their inferred decision making strategies, and show that natural groupings are 
formed. Finally we present a dynamic Bayesian classifier combination approach and investigate the changes in 
base classifier performance over time. 

1 Introduction 

In many real- world scenarios we are faced with the need to aggregate information from cohorts of imperfect deci- 
sion making agents (base classifiers), be they computational or human. Particularly in the case of human agents, 
we rarely have available to us an indication of how decisions were arrived at or a realistic measure of agent confi- 
dence in the various decisions. Fusing multiple sources of information in the presence of uncertainty is optimally 
achieved using Bayesian inference, which elegantly provides a principled mathematical framework for such knowl- 
edge aggregation. In this paper we provide a Bayesian framework for imperfect decision combination, where the 
base classifications we receive are greedy preferences (i.e. labels with no indication of confidence or uncertainty). 
The classifier combination method we develop aggregates the decisions of multiple agents, improving overall per- 
formance. We present a principled framework in which the use of weak decision makers can be mitigated and in 
which multiple agents, with very different observations, knowledge or training sets, can be combined to provide 
complementary information. 

The preliminary application we focus on in this paper is a distributed citizen science project, in which human 
agents carry out classification tasks, in this case identifying transient objects from images as corresponding to po- 
tential supernovae or not. This application. Galaxy Zoo Supernovae 1 1 1, is part of the highly successful Zooniverse 
family of citizen science projects. In this application the ability of our base classifiers can be very varied and there 
is no guarantee over any individual's performance, as each user can have radically different levels of domain ex- 
perience and have different background knowledge. As individual users are not overloaded with decision requests 
by the system, we often have little performance data for individual users (base classifiers). The methodology we 
advocate provides a scalable, computationally efficient, Bayesian approach to learning base classifier performance 
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thus enabling optimal decision combinations. The approach is robust in the presence of uncertainties at all levels 
and naturally handles missing observations, i.e. in cases where agents do not provide any base classifications. We 
develop extensions to allow for dynamic, sequential inference, through which we track information regarding the 
base classifiers. Through the application of social network analysis we can also observe behavioural patterns in the 
cohort of base classifiers. 

The rest of this paper is organised as follows. In the remainder of the Introduction we briefly describe related 
work. In Section [2] we present a probabilistic model for independent Bayesian classifier combination, IBCC. Sec- 
tion [3] introduces the approximate inference method, variational Bayes, and details its application to IBCC. Section 
[4] shows an example application for classifier combination. Galaxy Zoo Supemovae, and compares results using 
different classifier combination methods, including IBCC. In Sections |5] and |6] we investigate how communities of 
decision makers with similar characteristics can be found using data inferred from Bayesian classifier combination. 
Section [7] presents an extension to independent Bayesian classifier combination that models the changing perfor- 
mance of individual decision makers. Using this extension. Section [8] examines the dynamics of individuals from 
our example application, while Sections [9] and \T0\ show how communities of decision makers change over time. 
Finally, Section [TT] discusses future directions for this work. 



1.1 Related Work 

Previous work has often focused on aggregating expert decisions in fields such as medical diagnosis f2l . In contrast, 
crowdsourcing uses novice human agents to perform tasks that would be too difficult or expensive to process 
computationally or using experts |3 , 4|. The underlying problem of fusing labels from multiple classifications has 
been dealt with in various ways and a review of the common methods is given by 0. The choice of method 
typically depends on the type of labels we can obtain from agents (e.g. binary, continuous), whether we can 
manipulate agent performance, and whether we can also access input features. Weighted majority and weighted 
sum algorithms are popular methods that account for differing reliability in the base classifiers; an analysis of their 
performance is given by |6|. Bayesian model combination |7| provides a theoretical basis for soft-selecting from 
a space of combination functions. In most cases it outperforms Bayesian model averaging, which relies on one 
base classifier matching the data generating model. A well-founded model that learns the combination function 
directly was defined by 1 8|, giving a Bayesian treatment to a model first presented in f2l. A similar model was also 
investigated by |9 | with extensions to learn a classifier from expert labels rather than known ground truth labels. 
Both papers assume that base classifiers have constant performance, a problem that we address later in this paper. 



2 Independent Bayesian Classifier Combination 

Here we present a variant of Independent Bayesian Classifier Combination (IBCC), originally defined in lO. The 
model assumes conditional independence between base classifiers, but performed as well as more computationally 
intense dependency modelling methods also given by |8 1. 

We are given a set of data points indexed from 1 to A^, where the i\h data point has a true label ti that we wish 
to infer. We assume ti is generated from a multinomial distribution with the probabilities of each class denoted by 
K : p{ti = = nj. True labels may take values ti = 1... J, where J is the number of true classes. We assume 
there are K base classifiers, which produce a set of discrete outputs c with values I = I..L, where L is the number 

(k) 

of possible outputs. The output ^ from classifier k for data point i is assumed to be generated from a multinomial 

(k) / (k) (k)\ (k) 

distribution dependent on the true label, with parameters tTj ^ : p{c\ — l\ti = j^TZj) = tTj^ ' . This model places 
minimal requirements on the type of classifier output, which need not be probabilistic and could be selected from an 
arbitrary number of discrete values, indicating, for example, greedy preference over a set of class labels. Parameters 
TT^^^ and K have Dirichlet prior distributions with hyperparameters ol"^^- = [o^Q^j^, ...,aQ^j^] and u = [i^oi, •••^oj] 
respectively. We refer to the set of tt^^^ for all base classifiers and all classes as 11 = j^r^^^ | j = 1... J, /c = 
Similarly, for the hyperparameters we use Aq = | <^0j^^ \j = l...J^k = 1...k\ 
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The joint distribution over all variables for the IBCC model is 

N ^ ( ^ 

K/^,n,t:ciAo,i.) = Wi^t, n V (.)M^I'^Mn|A), (i) 

The graphical model for IBCC is shown in Figure 111 A key feature of IBCC is that tt^^^ represents a confusion 




Figure 1: Graphical Model for IBCC. The shaded node represents observed values, circular nodes are variables with 
a distribution and square nodes are variables instantiated with point values. 

matrix that quantifies the decision-making abilities of each base classifier. This potentially allows us to ignore or 
retrain poorer classifiers and assign expert decision makers to data points that are highly uncertain. Such efficient 
selection of base classifiers is vitally important when there is a cost to obtaining an output from a base classifier, for 
example, a financial payment to a decision maker per decision, or when the processing bandwidth of base classifiers 
is limited. The confusion matrices 11 in IBCC also allow us to predict any missing classifier outputs in c, so that we 
can naturally handle cases where only partially observed agents make decisions. 

The IBCC model assumes independence between the rows in 7r^^\ i.e. the probability of each classifier's 
outputs is dependent on the true label class. In some cases it may be reasonable to assume that performance over 
one label class may be correlated with performance in another; indeed methods such as weighted majority (61 make 
this tacit assumption. However, we would argue that this is not universally the case, and IBCC makes no such 
strong assumptions. 

The model here represents a simplification of that proposed in | 8 1, which places exponential hyperpriors over 
Aq. The exponential distributions are not conjugate to the Dirichlet distributions over 11, so inference using Gibbs 
Sampling |10| requires an expensive adaptive rejection sampling step |11| for A and the variational Bayesian 
solution is intractable. The conjugate prior to the Dirichlet is non-standard and its normalisation constant is not 
in closed form |[T2l . so cannot be used. We therefore alter the model, to use point values for Aq, as in other 
similar models 1131 [l4l fTSl. The hyperparameter values of Aq can hence be chosen to represent any prior level 
of uncertainty in the values of the agent-by-agent confusion matrices, 11, and can be regarded as pseudo-counts 
of prior observations, offering a natural method to include any prior knowledge and a methodology to extend the 
method to sequential, on-line environments. 

3 Variational Bayesian IBCC 

The goal of the combination model is to perform inference for the unknown variables t, 11, and k. The inference 
technique proposed in (91 is maximum a posteriori (MAP) estimation, while suggests a full Bayesian treatment 
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using Gibbs Sampling ifTOl . While the latter provides some theoretical guarantee of accuracy given the proposed 
model, it is often very slow to converge and convergence is difficult to ascertain. In this paper we consider the use 
of principled approximate Bayesian methods, namely variational Bayes (VB) 1 16] as this allows us to replace non- 
analytic marginal integrals in the original model with analytic updates in the sufficient statistics of the variational 
approximation. This produces a model that iterates rapidly to a solution in a computational framework which can 
be seen as a Bayesian generalisation of the Expectation- Maximization (EM) algorithm L17j . 

3.1 Variational Bayes 

Given a set of observed data X and a set of latent variables and parameters Z, the goal of variational Bayes (VB) is 
to find a tractable approximation q{Z) to the posterior distribution p{Z\X) by minimising the KL-divergence ifTSl 
between the approximate distribution and the true distribution |[T6l|T9l- We can write the log of the model evidence 

p{X) as 

Inp(X) = /,(Z)ln^dZ-/,(Z)ln^dZ (2) 

= L{q)-KL{q\\p). 

As g(Z) approaches p(Z|X), the KL-divergence disappears and the lower bound L{q) on Inp(X) is maximised. 
Variational Bayes selects a restricted form of q{Z) that is tractable to work with, then seeks the distribution within 
this restricted form that minimises the KL-divergence. A common restriction is to partition Z into groups of 
variables, then assume q{Z) factorises into functions of single groups: 

M 

q{Z) = Y[q^{Z^). (3) 

i=l 

For each factor qi{Zi) we then seek the optimal solution ql{Zi) that minimises the KL-divergence. Consider 
partitions of variables Zi and Z^, where Zi = {Zj \j j^ij = 1...M}. Mean field theory |20| shows that we can find 
an optimal factor ql{Zi) from the conditional distribution p(Z^|X,Z) by taking the expectation over all the other 
factors j\j ^ i^j = I...M . We can therefore write the log of the optimal factor lng*(Z^) as the expectation with 
respect to all other factors of the log of the joint distribution over all variables plus a normalisation constant: 

lng*(Z,) = E^(2)[lnp(X,Z)] + const. (4) 

In our notation, we take the expectation with respect to the variables in the subscript. In this case, ^q{z) [•••] indicates 
that we take an expectation with respect to all factors except q{Zi). This expectation is implicitly conditioned on 
the observed data, X, which we omit from the notation for brevity. 

We can evaluate these optimal factors iteratively by first initialising all factors, then updating each in turn using 
the expectations with respect to the current values of the other factors. Unlike Gibbs sampling, each iteration is 
guaranteed to increase the lower bound on the log-likelihood, L{q), converging to a (local) maximum in a similar 
fashion to standard EM algorithms. If the factors ql(Zi) are exponential family distributions 1 14], as is the case for 
the IBCC method we present in the next section, the lower bound is convex with respect to each factor ^*(Z^) and 
L{q) will converge to a global maximum of our approximate, factorised distribution. In practice, once the optimal 
factors ql{Zi) have converged to within a given tolerance, we can approximate the distribution of the unknown 
variables and calculate their expected values. 

3.2 Variational Equations for IBCC 

To provide a variational Bayesian treatment of IBCC, VB-IBCC, we first propose the form for our variational 
distribution, g(Z), that factorises between the parameters and latent variables. 

qinX"^) = q{i)q{i^.n) (5) 
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This is the only assumption we must make to perform VB on this model; the forms of the factors arise from our 
model of IBCC. We can use the joint distribution in Equation ^ to find the optimal factors and g'*(A^, H) in 
the form given by Equation (|4]). For the true labels we have 



lnq^{t) = E;^,n[lnp(A^,t^n,c)] + const. 



(6) 



We can rewrite this into factors corresponding to independent data points, with any terms not involving ti being 
absorbed into the normalisation constant. To do this we define pij as 



(7) 



K 



k=l 



E.[lnA.,]+^E^^.[ln^%^] 

J ?c ■ 



then we can estimate the probability of a true label, which also gives its expected value: 



q%U=j)=E^iU=j] = 



Pij 



(8) 



To simplify the optimal factors in subsequent equations, we define expectations with respect to t of the number of 
occurrences of each true class, given by 

N 

Nj = Y.E^[U=3], (9) 



(k) 

and the counts of each classifier decision c\ —I given the true label ti = j, by 



4'' = I'^c(^);%[^^=j1 



N 



(10) 



(k) 

where 6 (fc) is unity if q — / and zero otherwise. 

For the parameters of the model we have the optimal factors given by: 



lng'*(K,n) = E^[lnp(K,t,n,c)] + const 

' N N K 



_i=l i=lk=l 

+ lnp(n|Ao) +const. 



-lnp{K,\iyo) 



(11) 
(12) 



In Equation (11) terms involving k, and terms involving each confusion matrix in n are separate, so we can factorise 
g'* (k, tt) further into 

^*(A.,n)=g*(A.)nng*(7rf ). (13) 



k=ij=i 



In the IBCC model (Section[2]) we assumed a Dirichlet prior for k, which gives us the optimal factor 



N 

2=1 



-lnp{K\iyo) + const 



J J 
= ^ Nj In tvj + ^ (^0, j — 1 ) In tvj + const. 

Taking the exponential of both sides, we obtain a posterior Dirichlet distribution of the form 

q'^in) ocDir{K,\ui,...,uj) 



(14) 
(15) 

(16) 
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where v is updated in the standard manner by adding the data counts to the prior counts 



iyj = iyoj^Nj. (17) 
The expectation of InA^ required to update Equation ^ is therefore: 

E[ln/.,]=^(^,)-^( X^^J (18) 



where ^ is the standard digamma function [ 21 1. 



(k) 

For the confusion matrices tVj ^ the priors are also Dirichlet distributions giving us the factor 

N 

ln(/*(7rf) = £ Etjti = j] In 7r%+lnp(7rf 145) + const (19) 

i=l -^^^i 

I<'ln.;f + £ (45,-l)ln.;f +const. (20) 



Again, taking the exponential gives a posterior Dirichlet distribution of the form 



= Dir(.f|a;.t\...,af2) (21) 



(k) (k) 

where cxj ' is updated by adding data counts to prior counts q^q : 



4''=<]/+<^- (22) 



The expectation required for Equation ([5]) is given by 

E 



1 {k) 

in^y 



To apply the VB algorithm to IBCC, we first choose an initial value for all variables E[ln7rj^^] and E[ln/^j] 
either randomly or by taking the expectations of the variables over their prior distributions (if we have enough 
domain knowledge to set informative priors). We then iterate over a two-stage procedure similar to the Expectation- 
Maximization (EM) algorithm. In the variational equivalent of the E-step we use the current expected parameters, 
E[ln7rj^^] and E[lnA>:j], to update the variational distribution in Equation (sj). First we evaluate Equation (jsj), then 
use the result to update the counts Nj and N^^^ according to Equations (9 ) and (jioj). In the variational M-step, we 



update E[ln7rj^^] and E[ln/^j] using Equations ( 18 ) and (|23 
3.3 Variational Lower Bound 

To check for convergence we can also calculate the lower bound L{q) (see Equation ([s])), which should always 
increase after a pair of E-step and M-step updates. While we could alternatively detect convergence of the expec- 
tation over the latent variables, the variational lower bound is a useful sanity check for our derivation of VB-IBCC 
and for its implementations. 



m = ///,(f,n,.)in ^(^^'^^gg^»''^») dMnd. 



%,n,K t, n, K I Ao, i^o)] - ^t,n,K ^(i', n, k)] 

E,-n [^M¥, n)] + [lnp(t»] + En [Inp(n| ao)] + [lnp{K\uo)] 

-^t,n,JM^] - %,n[ln9(n)] - E,- [lng(«)] (24) 
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The expectation terms relating to the joint probabihty of the latent variables, observed variables and the param- 
eters are 



En[lnp(n|Ao) 



where B(a) 



nf=ir(aO ^ 



(fc) 

. (fe 



Inn 



(k) 



N K J 

i=ik=ij=i 

K J L 

k=l j=U=l 
J 

i=ij=i 

f I,(-lnB(ag) + t(aS,-l)E 



fc=ii=i 



InTT 



■lnB(i/o)+£Ki-l)E[lnKj] 



(25) 



(26) 
(27) 
(28) 



is the Beta function and r(a) is the Gamma function 1211 . Terms in the lower bound 



relating to the expectation of the variational distributions q are 



N J 



%,n[ln'/(n)] = IE(-lnB(af) + £(4y-l)E 



k=lj=l 



1 = 1 



Inn 



(k) 



E.^[ln^(K)] 



- InB (N + 2^o) + (A^j + ^o,j - 1) IE[ln/^j] 



(29) 
(30) 
(31) 



where N = [Ni , A/'j] is a vector of counts for each true class. 

Using these equations, the lower bound can be calculated after each pair of E-step and M-step steps. Once the 
value of the lower bound stops increasing the algorithm has converged to the optimal approximate solution. 



4 Galaxy Zoo Supernovae 

We tested the model using a dataset obtained from the Galaxy Zoo Supernovae citizen science project Q. The 
aim of the project is to classify candidate supernova images as either "supernova" or "not supernova". The dataset 
contains scores given by individual volunteer citizen scientists (base classifiers) to candidates after answering a 
series of questions. A set of three linked questions are answered by the users, which are hard-coded in the project 
repository to scores of -1, 1 or 3, corresponding respectively to decisions that the data point is very unlikely to be a 
supernova, possibly a supernova and very likely a supernova. These scores are our base classifier outputs c. 

In order to verify the efficacy of our approach and competing methods, we use "true" target classifications 
obtained from full spectroscopic analysis, undertaken as part of the Palomar Transient Factory collaboration ll22ll . 
We note that this information, is not available to the base classifiers (the users), being obtained retrospectively. We 
compare IBCC using both variational Bayes (VB-IBCC) and Gibbs sampling (Gibbs-IBCC), using as output the 
expected values of ti. We also tested simple majority voting, weighted majority voting & weighted sum [6] and 
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(a) Receiver operating characteristic (ROC) curves. 
Figure 2: Galaxy Zoo Supemovae: ROC curves and AUCs with 5-fold cross validation. 



mean user scores, which the Galaxy Zoo Supernovae currently uses to filter results. For majority voting methods 
we treat both 1 and 3 as a vote for the supernova class. 

The complete dataset contains many volunteers that have provided very few classifications, particularly for 
positive examples, as there are 322 classifications of positive data points compared to 43941 "not supernova" 
examples. Candidate images vary greatly in how difficult they are to classify, so volunteers who have classified 
small numbers of positive examples may have seen only easy or difficult examples, leading us to infer biased 
confusion matrices. Including large numbers of volunteers with little data will also affect our inference over true 
labels and confusion matrices of other decision makers. Therefore, we perform inference over a subsample of the 
data. Inferred parameters can be used to update the hyperparameters before running the algorithm again over other 
data point. To infer confusion matrices accurately, we require sufficient numbers of examples for both positive and 
negative classes. We therefore first select all volunteers that have classified at least 50 examples of each class, then 
select all data points that have been classified by at least 10 such volunteers; we then include other volunteers that 
have classified the expected examples. This process produced a data set of 963 examples with decisions produced 
from 1705 users. We tested the imperfect decision combination methods using five-fold cross validation. The 
dataset is divided randomly into five partitions, then the algorithm is run five times, each with a different partition 
designated as the test data and the other partitions used as training data. In the test partition the true labels are 
withheld from our algorithms and are used only to measure performance. 

Figure 2a shows the average Receiver-Operating Characteristic (ROC) curves 1231 taken across all cross- 
validation datasets for the mean score, weighted sum and VB-IBCC. Each point on the ROC curve corresponds 
to a different threshold value; classifier output values above a given threshold are taken as positive classifications 
and those below as negative. At each threshold value we calculate a true positive rate - the fraction of positive 
candidate images correctly identified - and a false positive rate - the fraction of negative candidates incorrectly 



8 



classified as positive. 

The ROC curve for VB-IBCC clearly outperforms the mean of scores by a large margin. Weighted sum achieves 
a slight improvement on the mean by learning to discount base classifiers each time they make a mistake. The 
performance of the majority voting methods and IBCC using Gibbs sampling is summarised by the area under the 
ROC curve (AUC) in Table [2b] The AUC gives the probability that a randomly chosen positive instance is ranked 
higher than a randomly chosen negative instance. Majority voting methods only produce one point on the ROC 
curve between and 1 as they convert the scores to votes (-1 becomes a negative vote, 1 and 3 become positive) 
and produce binary outputs. These methods have similar results to the mean score approach, with the weighted 
version performing slightly worse, perhaps because too much information is lost when converting scores to votes 
to be able to learn base classifier weights correctly. 

With Gibbs-sampling IBCC we collected samples until the mean of the sample label values converged. Con- 
vergence was assumed when the total absolute difference between mean sample labels of successive iterations did 
not exceed 0.01 for 20 iterations. The mean time taken to run VB-IBCC to convergence was 13 seconds, while for 
Gibbs sampling IBCC it was 349 seconds. As well as executing significantly faster, VB produces a better AUC 
than Gibbs sampling with this dataset. Gibbs sampling was run to thousands of iterations with no change in perfor- 
mance observed. Hence it is likely that the better performance of the approximate variational Bayes results from 
the nature of this dataset; Gibbs sampling may provide better results with other applications but suffers from higher 
computational costs. 



5 Communities of Decision Makers Based on Confusion Matrices (tt Com- 
munities) 

In this section we apply a recent community detection methodology to the problem of determining most likely 
groupings of base classifiers, the imperfect decision makers. Grouping decision makers allows us to observe the 
behaviours present in our pool of base classifiers and could influence how we allocate classification tasks or train 
base classifiers. Community detection is the process of clustering a "similarity" or "interaction" network, so that 
classifiers within a given group are more strongly connected to each other than the rest of the graph. Identifying 
overlapping communities in networks is a challenging task. In recent work | 24 1 we presented a novel approach to 
community detection that infers such latent groups in the network by treating communities as explanatory latent 
variables for the observed connections between nodes, so that the stronger the similarity between two decision 
makers, the more likely it is that they belong to the same community. Such latent grouping is extracted by an 
appropriate factorisation of the connectivity matrix, where the effective inner rank (number of communities) is 
inferred by placing shrinkage priors 1251 on the elements of the factor matrices. The scheme has the advantage 
of soft-partitioning solutions, assignment of node participation scores to communities, an intuitive foundation and 
computational efficiency. 

We apply the approach described in 1241 to a similarity matrix calculated over all the citizen scientists in our 
study, based upon the expected values of each users' confusion matrix. Expectations are taken over the distribu- 



tions of the confusion matrices inferred using the variational Bayesian method in Section 3.2 and characterise the 
behaviour of the base classifiers. Denoting E[7r(^)] as the (3 x 2) confusion matrix inferred for user i we may define 
a simple similarity measure between agents m and n as 

Vm,n = exp (E[7r5'")],E[7rf )]) j , (32) 

where is the Hellinger distance between two distributions, meaning that two agents who have very similar 

confusion matrices will have high similarity. Since the confusion matrices are multinomial distributions, Hellinger 
distance is calculated as: 



(E[7rf )],E[7rf ]) = 1 - £ ^W^^^^f^^^] (33) 

As confusion matrices represent probability distributions, so Hellinger distance is chosen as an established, sym- 
metrical measure of similarity between two probability distributions 1141 . Taking the exponential of the negative 
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Figure 3: Prototypical confusion matrices for each of the five communities inferred using Bayesian social network 
analysis. Each graph corresponds to the most central individual in a conmiunity, with bar height indicating probability 
of producing a particular score for a candidate of the given true class. 



Hellinger distance converts the distance measure to a similarity measure with a maximum of 1 , emphasising cases 
of high similarity. 

Application of Bayesian community detection to the matrix V robustly gave rise to five distinct groupings 
of users. In Figure [3] we show the centroid confusion matrices associated with each of these groups of citizen 
scientists. The centroids are the expected confusion matrices of the individuals with the highest node participation 
scores for each community. The labels indicate the "true" class (0 for "not supernova" or 1 for "supernova") and the 
preference for the three scores offered to each user by the Zooniverse questions (-1, 1 & 3). Group 1, for example, 
indicates users who are clear in their categorisation of "not supernova" (a score of -1) but who are less certain 
regarding the "possible supernova" and "likely supernova" categories (scores 1 & 3). Group 2 are "extremists" 
who use little of the middle score, but who confidently (and correctly) use scores of -1 and 3. By contrast group 
3 are users who almost always use score -1 ("not supernova") whatever objects they are presented with. Group 4 
almost never declare an object as "not supernova" (incorrectly) and, finally, group 5 consists of "non-committal" 
users who rarely assign a score of 3 to supernova objects, preferring the middle score ("possible supernova"). It is 
interesting to note that all five groups have similar numbers of members (several hundred) but clearly each group 
indicates a very different approach to decision making. 



6 Common Task Communities 

In this section we examine groups of decision makers that have completed classification tasks for similar sets of 
objects, which we label common task communities. Below we outline how these communities and the corresponding 
confusion matrices could inform the way we allocate tasks and train decision makers. Intelligent task assignment 
could improve our knowledge of confusion matrices, increase the independence of base classifiers selected for a 
task, and satisfy human agents who prefer to work on certain types of task. We apply the overlapping community 
detection method 1241 to a co-occurrence network for the Galaxy Zoo Supernovae data. Edges connect citizen 
scientists that have completed a common task, where edge weights Wmn reflect the proportion of tasks common to 
both individuals, such that 

number_of_common_tasks(m, n) 
= 0.5(iVM+iVW) ' ^ ^ 

where N^^^ is the total number of observations seen by base classifier k. The normalisation term reduces the 
weight of edges from decision makers that have completed large numbers of tasks, as these would otherwise have 
very strong links to many others that they proportionally have little similarity to. The edge weights capture the 
correlation between the tasks that individuals have completed and give the expectation that for a classifier label 
c^l^^ chosen randomly from our sample, the classifier n will also have produced a label c^^^ It is possible to place a 
prior distribution over these weights to provide a fully Bayesian estimate of the probability of classifiers completing 
the same task. However, this would not affect the results of our community analysis method, which uses single 
similarity values for each pair of nodes. For decision makers that have made few classifications, edge weights may 
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be poor estimates of similarity and thus introduce noise into the network. We therefore filter out decision makers 
that have made fewer than 10 classifications. 

The algorithm found 32 communities for 2131 citizen scientists and produced a strong community structure 
with modularity of 0.75. Modularity is a measure between -1 and 1 that assumes a strong community structure has 
more intra-community edges (edges that connect nodes in the same community) than inter-community edges. It is 
the fraction of intra-community edges minus the expected fraction of intra-community edges for a random graph 
with the same node degree distribution 1261 . In Galaxy Zoo Supernovae, this very modular community structure 
may arise through users with similar preferences or times of availability being assigned to the same objects. Galaxy 
Zoo Supernovae currently prioritises the oldest objects that currently lack a sufficient number of classifications and 
assigns these to the next available citizen scientists. It also allows participants to reject tasks if desired. Possible 
reasons for rejecting a task are that the decision maker finds the task too difficult or uninteresting. Common task 
communities may therefore form where decision makers have similar abilities, preferences for particular tasks (e.g. 
due to interesting features in an image) or are available to work at similar times. When considering the choice of 
decision makers for a task, these communities could therefore inform who is likely to be available and who will 
complete a particular task. 




Figure 4: Distribution of means of community members' confusion matrices for all common task conmiunities. 
Proximity to a vertex indicates the probability of a score given an object with the stated true label class, e.g. in the 
graph labelled "Supernova", a point near the vertex labelled "score==-r' indicates a very high probability of giving 
a decision of -1 when presented with images of a genuine supernova. The left-hand plot shows the mean confusion 
matrices for t = 0, i.e. the class "not a supernova"; the right-hand plot shows the confusion matrices for t = 1 or 
"supernova". The size of the nodes indicates the number of members of the cluster. 

In Figure [4] we plot the distribution of the means of the community members' confusion matrices for each of 
the true classes. Differences between communities for t = (not supernova class) are less pronounced than for the 
t= I (the supernova class). For the latter class we have 5134 observations as opposed to 48791 for the former, so 
that decision makers see fewer tasks with true label "supernova". This means that individual tasks with features 
that are more likely to elicit a certain base classifier response can have a greater effect on the confusion matrices 
learned. For instance, some tasks may be easier to classify than others or help a decision maker learn through 
the experience of completing the task, thus affecting the confusion matrices we infer. As we would expect, some 
smaller communities have more unusual means as they are more easily influenced by a single community member. 
The effect of this is demonstrated by the difference between community means in Figure]?] for groups of decision 
makers that have completed common sets of tasks. 
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7 Dynamic Bayesian Classifier Combination 



In real-world applications such as Galaxy Zoo Supernovae, confusion matrices can change over time as the im- 
perfect decision makers learn and modify their behaviour. We propose a dynamic variant of IBCC, DynlBCC, 
that models the change each time a decision maker performs a classification task. Using these dynamic confusion 
matrices we can also observe the effect of each observation on our distribution over a confusion matrix. 



In DynlBCC, we replace the simple update step for cxj given by Equation ( 22 ) with an update for every sample 
classified at time-steps denoted by r, giving time-dependent parameters cxrj- Figure [5] shows the graphical model 
for DynlBCC. As we detail in this section, the values of cxrj are determined directly (rather than generated from 
a distribution) from the values of 11 for the previous and subsequent samples seen by each base classifier k. We 
use a dynamic generalised linear model |27|, which enables us to iterate through the data updating cxr depending 
on the previous value OLr-\. This is the forward pass which operates according to Kalman filter update equations. 
We then use a Modified Bryson-Frazier smoother 1281 to scroll backward through the data, updating cXr based on 
the subsequent value olt+i- The backward pass is an extension to the work in |29|, where updates are dependent 
only on earlier values of a. DynlBCC hence enables us to exploit a fully Bayesian model for dynamic classifier 
combination, placing distributions over tt^, j , while retaining computational tractability by using an approximate 
method to update the to hyperparameters at each step. 

The base classifier k may not classify the samples in the order given by their global indexes z = 1, A^, so we 
map global indexes to time-steps r = 1,...,T(^) using 

= (35) 

The mapping f^^^ records the order that k classified items, with time-step r being the time-step that sample i was 
classified. For an object iunseen not classified by k, /*^^^ {iunseen) = 0- The inverse of /*^^^ (r) is ii^^ = /~^(^) (r), 
a mapping from the time-step r z to the object i that was classified at that time-step. 

The dynamic generalised linear model allows us to estimate the probability of a classifier producing output / 
given true label : 

'rtr,l =^r,ti^l =P{Cir =l\K) (36) 

in which we omit the superscripts for clarity. We first specify our generalised linear model 1291 by defining a 
basis function model with the form 

7tr,i =g{hlwr,i) (37) 

where h^- is a binary input vector of size J, with hr,ti^ = 1 and all other values equal to zero, i.e. a binary vector 
representation of = j. The function ^(.) is an activation function that maps the linear predictor rj^^i = h^^w^-^^ to 
the probability itr^i of base classifier response /. If we consider each possible classifier output / separately, the value 
Tt^^l is the probability of producing output / and can be seen as the parameter to a binomial distribution. Therefore 
^(.) is the logistic function, hence 



and its inverse, the canonical link function, is the logit function: 



(38) 



r]r,i = logit(7f,,0 = log (yZ^) (39) 

In the dynamic generalised linear model ||27l|29l, we track changes to the distribution over ftr^i over time by treating 
w^-^^ as a state variable that evolves according to a random walk 

Wr,/ = W^_i,^ +V^,^ (40) 

where v^-^^ is the state noise vector that corresponds to the drift in the state variable over time. We assume that 
state noise has a distribution where only the first two moments are known, v^-^^ ~ (0,(7^^^ I), where I is the identity 
matrix. The the state noise variance q^-^i will be estimated from our distributions over Tt^^h as explained below. 
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Figure 5: Graphical model for DynlBCC. The dashed arrows indicate dependencies to nodes at previous or subse- 
quent time-steps. Solid black circular nodes are variables calculated deterministically from their predecessors. The 
shaded node represents observed values, circular nodes are variables with a distribution and square nodes are variables 
instantiated with point values. 
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7.1 Prior Distributions over the State Variables 

Here we consider Bayesian inference over the state variable w^-^^ . As we place a distribution over ftr^i in our model, 
we also have a distribution over w^-^^ Our sequential inference method is approximate since we only estimate the 
mean and co variance over w^-^^ rather than the full form of the distribution. At time r, given observations up to 
time- step r — 1, the prior state mean at time r is w^|^_i^ and its prior covariance is P^|^_i^^ . These are related to 
the posterior mean and covariance from the previous time-step r — 1 by 

Wr|T-l,^ = W^-1|t-1,^ (41) 
^r\r-l,l = ^r-l\r-l,l^Qr,ll- (42) 

We estimate the state noise variance q^^i as 

<?T+i,^ =m^x[u^\^^l-u^\^_i^l,0]^Zr,l (43) 

where u^\r^i and are u^^^_i^i the variances in the distribution over the classifier outputs c after observing data up to 
time r and r — 1 respectively, and z^^i is the uncertainty in the classifier outputs. For observations up to time-step 
V, we define u^^^^i as: 

= ^t\v,i{^-^t\v,i) (44) 

where 

K\v,i =E[7f^,^|ci^,...,Ci^,t^ao] =g{hlw^\^^i) . (45) 

When the classifier outputs are observed, z^^i is zero; when they are not observed, we use 7r^|^_i^ as an estimate 
of the missing output, so z^^i =u^\^_i^i. 

From Equations (41 ) and (42) we can specify the mean and variance of the prior distribution of r]r,l- 

fir\T-l,l=K'^T\r-l,l (46) 

r^l^_i,^ =h;j:P^I^_i^^h^ (47) 

We can now use 7)r|T-i,/ ^t\t-i,i ^ estimate the parameters of the prior distribution over Tt^^i as follows. 
The dynamic generalised linear model allows the distribution of the output variable Ci^ to have any exponential 
family distribution fT4l. In DynlBCC, the discrete outputs have a multinomial distribution, which is a member 
of the exponential family, with the Dirichlet distribution as the conjugate prior. Therefore, DynlBCC places a 
Dirichlet prior over tt^- with hyperparameters a^- that are dependent on the true label ti^ . If we consider a single 
classifier output Ci^ =1, then 

7f^,^ -Beta(a^,^/3^,0. (48) 

where (S^-^i = 11^=1,1^1 ^r,l and L is the number of possible base classifier output values. Since tt^- is related to 
r]r,l by the logistic function (Equation (38 )), we can write the full prior distribution over rj^^i in terms of the same 
hyperparameters : 



1 ^WiVr. 



7-1 



B{ar,i,f3r,i) (l+exp(7?^,0)^"'^^^"'^ 

where B(a, b) is the beta function. This distribution is recognised as a beta distribution of the second kind 1301 . We 
can approximate the moments of this prior distribution as follows: 

fir\r-l,l=nVr,l\cU'"^t-uU,,...U^] = " ^(/^r,/) (50) 

- logf^^l (51) 



.A, 

r\r-i,i=nr]r,i\cU"'^t-uU,,...U^] = + ^'(/^r,0 (52) 

^ ^ + ^ (53) 
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From these approximations we can calculate d^- and /S^- in terms of fl^\^_i and 

l+exp(77^l^_i^) 

c^T,i (54) 

rT\T-l,l 

^t|t-1,^ 

This gives us approximate hyperparameters for the prior distribution over 77^. 

7.2 Forward Pass Filtering Steps 

The forward pass filtering steps update the distribution over ftr given an observation of the base classifier output 
Ci^ at time r. We calculate the posterior hyperparameters a^^^^i and I3^\r^i by adding to the prior parameters: 

0^r\T,l = Oir,l^Sc-^i (56) 
Pr\r,l = Pr,l^{l-Sc,^l). (57) 

In this way we update the pseudo-counts of base classifier output values, as we did in the static IBCC model in 
Equation ([22]). The posterior mean and variance are then approximated by 

Vrlr, - log(^) (58) 

Trlr ^ T^ + ^- (59) 
^t\t Pt\t 

Then, we can apply an update to the mean and covariance of the state variable using linear Bayesian estimation, 
described in 1311 : 

Wr|T,^ = W^|r-1,^+K:^,^ (^t|t,^-^t|t-1,0 (^0) 
P,|,,, = (l-K,,,hT)P,|,_i/l-^^') (61) 

where K^-^^, the equivalent of the optimal Kalman gain is 

K,,^ = ' ' (62) 

and I is the identity matrix. The term — — in the covariance update corresponds to our uncertainty over r]^ i , 

which we do not observe directly. Linear Bayes estimation gives an optimal estimate when the full distribution 
over the state variable w^-^^ is unknown, and therefore differs from a Kalman filter in not specifying a Gaussian 
distribution over v^^i in Equation (40). 

To perform the forward pass we iterate through the data: for each time-step we calculate the prior state moments 
using Equations (41 ) and (42), then update these to the posterior state moments w^|^ and The forward pass 
filtering operates m a sequential manner as posterior state moments from time-step r — 1 are used to calculate the 
prior moments for the subsequent time-step r. 



7.3 Backward Pass Smoothing Steps 

After filtering through the data calculating w^|^^ and ^t\t,i we then run a backward pass to find the approximate 
posterior moments given all subsequent data points, vfr\N,l ^t\n,u from these the posterior hyperparam- 
eters given all data, a^i^y- The backward pass is a Modified Bryson-Frazier smoother |28J, which updates the 
distribution using the adjoint state vector A^-^^ and adjoint covariance matrix A^^^ as follows: 



15 



^t\N,1 



^r\r,l-^T\T,lK,l 
^t\t,1 -^t\t,1^t,1^t\t,1' 



(63) 
(64) 



In our dynamical system the state w^- evolves according to Equation (40), so A^- and A^- are defined recursively as 
the posterior updates from the subsequent step r + 1 given data from r + 1 to A^. 

{VT\r,l - Vt\t-1,i) + (I - ^T,lK) K 



K 

K,i 

Ar 

An 



rr\r-\,l 



^t|t-1,1 
Ar+1 





(65) 



_J_t\t£_ 
rT\r-l,l 



Estimates for final posterior hyperparameters are therefore given by 

Vt\N,1 
rT\N,l 



^t\N,1 
Pt\N,1 

7.4 Variational Update Equations 



K'^t\N,1 

rr\N,l 
l+exp(-77^|AA,^) 
rr\N,l 



(66) 
(67) 

(68) 

(69) 
(70) 

(71) 
(72) 

(73) 
(74) 



. We continue to omit the 

notation for clarity. The dynamic model instead uses a variational distribution for each time-step, g'*(7r^^j) given 



We can now replace the variational distribution for ^*(7rj) given by Equation (21 

no 
by 



= Dir (TTr , j I I AT, j 1 , . • . , <^T I NJ L ) 

where Dir() is the Dirichlet distribution with parameters cXr\N,j calculated according to 



PrlNJjl 



(75) 
(76) 

(77) 



We calculate w^i^v and Vr\N using the above filtering and smoothing passes, taking the expectation over t so we 
replace hrj with h^j = ^fiU^ = j]- Equation (77) is used to derive the hyperparameters for each row of the 
confusion matrix and thus each possible value of U^; thus it is equivalent to Equation (74) with hrj = 1. This 
update equation replaces Equation ([22]) in the static model. The expectation given by Equation ([23]) becomes 



E[\n7rrji]='i'{a^\N,jl)-'^ \ ^r\N,jm 



This can then be used in the variational distribution over ti as follows, replacing Equation ([5]): 



K 



lng*(t,)=E[lnA>:,J+^E 

k=i 



Inn 



- const. 



(78) 



(79) 
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7.5 DynlBCC Joint and Posterior Distributions 



In DynlBCC, we altered the IBCC model to use the time-dependent confusion matrices, giving the joint distribution 
over all latent variables and parameters in the model as follows. Here we are considering distributions for all base 
classifiers and therefore must re-introduce the k superscript notation. 



N 



p{K,nXc\oco,iyo) = Y[{'^t^Il (fc) 



(fc) 



(80) 



where is the set of base classifiers that have completed classification task i. There is a change of notation 
from Equation ([T]) for the static IBCC model, which iterates over all K classifiers. Here we iterate over the set 
because r^^^ is undefined for objects that have not been classified by k. The static model (Equation does not 
have this issue as the confusion matrix is the same for all tasks, and thus Equation ([T]) defines a joint probability 
over all observed and unobserved base classifier outputs. In DynlBCC, if we wish to determine a distribution over 
din unobserved base classifier output c) ' we must also determine a suitable confusion matrix by determining 



which time-step the unseen classifier output occurred at. 

can be estimated by finding the mean r/^j^ 



In Equation (80) above, the prior over tt 
f^\^_ijl of the linear predictor from its value at the previous time-step, given by: 



and variance 



\Lm=l,^l^r-l,jl ^ 



(81) 



The bar notation rj indicates a variable that is calculated deterministically from the previous state given the value of 



covariance qr,i^, the moments of the linear predictor are 



(k) 

TT^ .Considering the random walk Equation (|40|), where the change v^-^^ from the previous state has mean and 



Vrlr-ljl 

^T\T-l,jl 



Vr-ljl 



(82) 
(83) 



where q^^i is estimated as per Equations (43). From these values we can calculate the parameters a^ji for a 

(k) 

Dirichlet prior over tt : 



/ (k) |, {k) {k) 



(84) 



(k) - (k) 

For r- — the parameter aiji = aoji. For r- ^ > it is given by: 



a 



^t\t-IJI 



(85) 



7.6 Duplicate Classifications 

The original static model did not allow for duplicate classifications of the same object by the same base classifier. 
We assumed that even if a base classifier alters their decision when they see an object a second time, the two 
decisions are likely to be highly correlated and so cannot be treated as independent. However, the dynamic model 
reflects the possibility that the base classifier may change its own underlying model; therefore responses may be 
uncorrelated if they are separated by a sufficient number of time- steps or if the confusion matrix changes rapidly 
over a small number of time-steps. A model that handles dependencies between duplicate classifications at time 
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Figure 6: Graphical model for DynlBCC allowing multiple classifications of the same object by the same classifier 
(duplicate classifications). The dashed arrows indicate dependencies to nodes at previous or subsequent time-steps. 
Solid black circular nodes are variables calculated deterministically from their predecessors. The shaded node repre- 
sents observed values, circular nodes are variables with a distribution and square nodes are variables instantiated with 
point values. 



roriginal and time Tdupiicate may adjust ttV^;.^.^^^ and ttV^;^^ .^^^^ to compensate for correlation. However, in 
applications where duplicates only occur if they are separated by a large number of time- steps it may be reasonable 
to treat them as independent observations. In cases where duplicates are allowed we index decisions by their 

(k) 

time-step as Cr • For model variants that permit duplicates the joint distribution is hence: 



p(K,n,t,clAo,i^o) 

N 



(86) 



where as before f^^\i) maps an object i to the time-step at which i was classified by base classifier k. For a 

sample iunseen not classified by f^^\iunseen) = 0. 

We must also update Equation ([79]) as to allow duplicates as follows: 



lng*(t,)=E[lnA>:,J+ ^ ^ E 



InTT 



ik) 



- const. 



(87) 



The resulting graphical model is shown in Figure [6j with an additional plate to allow different time-steps r that 
correspond to the same base classifier k and object i. 



18 



7.7 Variational Lower Bound 



We now give the variational lower bound for Dynamic IBCC using the formulation that permits duplicates. We use 
n = l^r^^j I'T = 1, ..,T(^\ j = 1, .., J, k = 1, .., i^rj to refer to the set of confusion matrices for all classifiers and 
all time- steps. 



L{q) = lllq{t,n,K] 



q{t,U,K.) 



= %,n,K[lnp(c,^n,K|Ao,«^o)] -Ei;n,/tP"9(*^n,K)] 

= E^Jlnp{c\t,U)] +En[lnpiU\t,Ao)] +E^[lnp(«|iyo)] 

-\n,JM^] -%,n[ln9(n)] -EpJ\nq{K)] (88) 
The expectation terms relating to the joint probability of the latent variables, observed variables and the param- 



eters are as for the static model in Subsection 3.3 



except 



En[lnp(n|Ao)] 



N J 

III lE[t,=j]E 

i=l keCi^^f{k)(^-p = l 
K J t(^) 

EEE{-lnB(aS) 



InTT 



{k) 
■ (fe) 



k=l j = l T=l 

^=1 ^ 



IE 



InTT 



(89) 

(90) 
(91) 



In DynlBCC, the expectation over the variational distribution q*{n) also differs from static IBCC: 



k=l j=\ T=l 

L 



InTT, 



(fc) ■ 



(92) 
(93) 



8 Dynamics of Galaxy Zoo Supernovae Contributors 

We applied the variational Bayesian DynlBCC to the Galaxy Zoo Supernovae data from Section |4] to examine the 
changes to individual confusion matrices. There is a large variation in the dynamics of different decision makers 
but in many there are sustained drifts in a particular direction. We give examples of the different types of changes 
found for different base classifiers - the Galaxy Zoo Supernovae volunteers - in Figures [7] and ([8|. These are 
ternary plots of the expected confusion matrices at each time-step or observation of the decision maker. Each line 

(k) 

corresponds to the confusion vector tVj ' for true class j. To help the reader time-align the traces, certain time- steps 
have been labelled with a blue marker and edge between the two confusion vectors, with the label denoting the 
global number of observations for all base classifiers at this point. The example volunteers classified 29651, 21933, 



23920 and 20869 candidates respectively. The first example is Figure 7a which shows a confusion matrix with 
some drift in the earlier time-steps and some small fluctuations later on for the "not supernova" class. The decision 



maker shown in Figure 7b appears to have a more sustained drift away from scores of 3 for both true classes. The 



earliest changes in both these decision makers, such as for the "supernova" class in Figure [Tb) appear to be a move 
away from the prior, which affects the first data points most. The last two examples show more sudden changes. 



In Figure 8a we see a very significant change in the later observations for the "not supernova" class, after which 
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Changes in Confusion Matrix of Base Classifier 79142 Clianges in Confusion Matrix of Base Classifier 142372 




(a) Volunteer ID 79142 (b) Volunteer ID 142372 

Figure 7: Ternary plot showing the dynamics for Galaxy Zoo Supernovae volunteers. Each line plots the evolution of 
a row of the confusion matrix corresponding to a particular true class. Proximity of the line to a vertex indicates the 
probability of generating a certain score for a candidate object with the given true class. Blue markers help the 
reader align points on the two lines, with a label indicating the global number of observations at that point. 




Figure 8: Ternary plot showing the dynamics for Galaxy Zoo Supernovae volunteers. Each line plots the evolution 
of a row of the confusion matrix corresponding to a particular true class. Proximity of the line to a vertex indicates 
the probability of generating a certain score for a candidate object with the given true class.Blue markers help the 
reader align points on the two lines, with a label indicating the global number of observations at that point. 



the confusion matrix returns to a similar point to before. Figure 8b shows little initial change followed by a sudden 
change for the "not supernova" class, which then becomes fairly stable. The dynamics observed were inferred over 
a large number of data points, suggesting that the longer trends are due to genuine changes in performance of base 
classifiers over time. Smaller fluctuations may be due to bias in the observations (e.g. for a task that is very easy) or 
a change in behaviour of the citizen scientists, but sustained changes after the initial move away from the priors are 
more suggestive of a change in behaviour or in the information presented to agents when making classifications. 
For all four examples, there are more initial fluctuations, which could relate to the way that new volunteers adapt 
when they complete their first tasks. 



9 Dynamics of tt Communities 

We now apply the community analysis method used in Section [5] to the dynamic confusion matrices to examine 
the development of the community structure over time. After different a number of observations s we run the 
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community detection method 1241 over an adjacency matrix, using equation ( [32] ) with the most recent confusion 
matrices for all base classifiers observed up to 5. 




True Class Score 



(a) 3000 observations. 




True Class Score 



(b) 12000 observations. 




(c) 26558 observations. 

Figure 9: tt communities: means over the expected confusion matrices of community members after different numbers 
of observations. At each time point we see a new community appear while previous communities persist with similar 
means. 

In Figure |9] we see how the same communities emerge over time as we saw in Section [5] in Figure [3] Initially, 
only three communities are present, with those corresponding to groups 4 ("optimists") and 1 ("reasonable") in 
Figure |3] only appearing after 1200 and 26558 observations. The "reasonable" group is the last to emerge and 
most closely reflects the way the designers of the system intend good decision makers to behave. It may therefore 
appear as a result of participants learning, or of modifications to the user interface or instructions as the Galaxy Zoo 
Supemovae application was being developed to encourage this behaviour. 

We also note that agents switch between communities. In Figure [TOlwe show the node participation scores at 
each number of observations s for the individuals we examined in SectionjS] Community membership changes after 
significant changes to the individual's confusion matrix. However, the communities persist despite the movement 
of members between them. 

10 Dynamics of Common Task Communities 

Finally, we look at the evolution of the common task communities to observe the effect of recent tasks on the 
community structure and confusion matrices. We wish to observe whether distinct communities are persistent as 
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Base Classifier 79142 



Base Classifier 139963 




Base Classifier 1 42372 Base Classifier 259297 




Figure 10: Node participation scores for the tt communities for selected individuals shown in Section[8]after different 
numbers of observations. Each bar corresponds to the individual's participation score in a particular community after 
running the community analysis over all observations up to that point. Participation scores close to one indicate very 
strong membership of a community. The node participation scores for one number of observations sum to one over 
the communities 1 to 5. 



more tasks are completed. Changes to the structure inferred may be a result of observing more data about the base 
classifiers. Alternatively, individual behaviours may evolve as a result of learning new types of tasks or changes to 
individual circumstances, such as when a volunteer is available to carry out tasks. Our choice of community analysis 
method, given in Section [5]has the advantage that only a maximum number of communities need be chosen by the 
programmer, with the algorithm itself finding the most likely number of communities from the network itself. Here 
we show the changes that occur in Galaxy Zoo Supernovae. We generated three co-occurrence networks from 
all tasks completed up to 50,000, 200,000 and 493,048 observations. As before, we remove base classifiers with 
fewer than 10 classifications to filter out edges that may constitute noise rather than significant similarity. The 
algorithm produced community structures with modularities of 0.67, 0.69 and 0.75 respectively, showing that good 
community structure is present for smaller periods of observations (see Section |5] for definition of modularity). 
Figures ( 11 ), (12) and ([13]) show the means of the community members at each of these time slices, weighted by 
node participation. Since DynlBCC models the dynamics of base classifier confusion matrices as a random walk, 
the observations closest to the current time have the strongest effect on the distribution over the confusion matrices. 
Therefore, the expected confusion matrices can readily be used to characterise a community at a given point in 
time. When calculating the means we use the expected confusion matrix from the most recent time- step for that 
network. 

In all three networks there is a persistent core for both true classes, where the means for the large communities 
remain similar. Some communities within this group move a small amount, for example, the large red community 
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Not Supernova 



Supernova 




Figure 11: Common task communities: ternary plot of means of expected confusion matrices for community members 
after 50,000 observations. Each point corresponds to one common task community and represents the mean of E[7rj] 
for the community members. Proximity of a point to a vertex indicates the probability of outputting a particular score 
when presented with an object of true class "supernova" or "not supernova". 



Not Supernova 
score=-1 




Supernova 



score=-1 




score=3 



score=1 score=3 



score=1 



Figure 12: Common task communities: ternary plot of means of expected confusion matrices for community members 
after 50,000 observations. Each point corresponds to one common task community and represents the mean of E[7rj] 
for the community members. Proximity of a point to a vertex indicates the probability of outputting a particular score 
when presented with an object of true class "supernova" or "not supernova". 



in the "Supernova" class. In contrast, we see more scattered small communities appear after 200,000 observations 
and at 493,048 observations. It is possible that the increase in number of base classifiers as we see more data 
means that previous individual outliers are now able to form communities with similar outliers. Therefore outlying 



23 



Not Supernova 



Supernova 




Figure 13: Common task communities: ternary plot of means of expected confusion matrices for community members 
after 50,000 observations. Each point corresponds to one common task community and represents the mean of E[7rj] 
for the community members. Proximity of a point to a vertex indicates the probability of outputting a particular score 
when presented with an object of true class "supernova" or "not supernova". 



communities could be hard to detect with smaller datasets. Many of these appear in the same place in only one 
of the figures, suggesting that they may contain new base classifiers that have made few classifications up to that 
point. Some are less transient however: the top-most community in the "not supernova" class in Figures ([12]) 
and (13\ moves only a small amount. Similar sets of tasks may produce more extreme confusion matrices such 
as these for different agents at different times, implying that these tasks induce a particular bias in the confusion 
matrices. The changes we observe in Figures ( pTj ), ([12]) and ( [T3] ) demonstrate how we can begin to identify the 
effect of different tasks on our view of the base classifiers by evaluating changes to the community structure after 
classifying certain objects. Future investigations may consider the need to modify the co-occurrence network to 
discount older task-based classifier associations. 



11 Discussion 

In this paper we present a very computationally efficient, variational Bayesian, approach to imperfect multiple 
classifier combination. We evaluated the method using real data from the Galaxy Zoo Supernovae citizen sci- 
ence project, with 963 classification tasks, 1705 base classifiers and 26,558 observations. In our experiments, our 
method far outperformed all other methods, including weighted sum and weighted majority, both of which are of- 
ten advocated as they also learn weightings for the base classifiers. For our variational Bayes method the required 
computational overheads were far lower than those of Gibbs sampling approaches, giving much shorter compute 
time, which is particularly important for applications that need to make regular updates as new data is observed, 
such as our application here. Furthermore, on this data set at least, the accuracy of predictions was also better than 
the slower sampling-based method. We have shown that social network analysis can be used to extract sensible 
structure from the pool of decision makers using information inferred by Bayesian classifier combination or task 
co-occurrence networks. This structure provides a useful grouping of individuals and gives valuable information 
about their decision-making behaviour. We extended our model to allow for on-line dynamic analysis and showed 
how this enables us to track the changes in time associated with individual base classifiers. We also demonstrated 
how the community structures change over time, showing the use of the dynamic model to update information 
about group members. 
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Our current work considers how the rich information learned using our models can be exploited to improve the 
base classifiers, namely the human volunteer users. For example, we can use the confusion matrices, H, and the 
community structure to identify users who would benefit from more training. This could take place through inter- 
action with user groups who perform more accurate decision making, for example via extensions of apprenticeship 
learning f32l. We also consider ways of producing user specialisation via selective object presentation such that 
the overall performance of the human-agent collective is maximised. We note that this latter concept bears the hall- 
marks of computational mechanism design |[33l and the incorporation of incentive engineering and coordination 
mechanisms into the model is one of our present challenges. Future work will also investigate selecting individuals 
for a task to maximise both our knowledge of the true labels and of the confusion matrices, for example, by looking 
at the effects of previous tasks on the confusion matrices. To bring these different aspects together, we consider 
a global utility function for a set of classification and training tasks indexed z = 1 , . . . , assigned to a set of base 
classifiers /c = 1, i^T. Classifiers assigned to object i are part of coalition to maximise the total expected value 
of these assignments: 

N 

F(Ci,...,CAr) = ^Fo6,ect(C.) + ^dm(C.) + Ko.t(C.) (94) 
i=\ 

where Vohject{^i) is the expected information gain about the true class of object i from the classifiers in C^, 
Vdm{^i) is the improvement to the decision makers through this assignment and Vcost{k^i) captures other costs, 
such as payments to a decision maker. The value Vohject{^i) should be higher for classifiers in that are inde- 
pendent, so coalitions of decision makers from different communities may be favoured as different experience and 
confusion matrices may indicate correlation is less likely. Vohject{^i) should also account for specialisations, for 
example, by members of the same common task community. Vdm(Ci), captures expected changes to confusion 
matrices that result from the combiner learning more about base classifiers and from base classifiers improving 
through training or experience. In Galaxy Zoo Supernovae, for example, the contributors are attempting to iden- 
tify objects visually from a textual description. The description may leave some ambiguity, e.g. "is the candidate 
roughly centred". Seeing a range of images may alter how "roughly" the candidate can be centred before the con- 
tributor answers "yes". Thus the value Vdm{^^i) will depend on the objects previously classified by classifier k. A 
key direction for future work is defining these values so that systems such as Galaxy Zoo Supernovae can feed back 
information from confusion matrices and community structure to improve the overall performance and efficiency of 
the pool of decision makers. Common task communities and tt communities may play a central role in estimating 
the effects of task assignments and training on related individuals. The could also be exploited to reduce the size 
of the task assignment problem to one of choosing classifiers from a small number of groups rather than evaluating 
each classifier individually. 
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